AITopics | policy-gradient method

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method

Neural Information Processing SystemsDec-24-2025, 09:52:23 GMT

Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions.In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question.

combinatorial problem, name change, optimizing solution-sampler, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Unraveling the Rainbow: can value-based methods schedule?

Corrêa, Arthur, Jesus, Alexandre, Nascimento, Paulo, Silva, Cristóvão, Moniz, Samuel

arXiv.org Artificial IntelligenceDec-1-2025

In this work, we conduct an extensive empirical study of several deep reinforcement learning algorithms on two challenging combinatorial optimization problems: the job-shop and flexible job-shop scheduling problems, both fundamental challenges with multiple industrial applications. Broadly, deep reinforcement learning algorithms fall into two categories: policy-gradient and value-based. While value-based algorithms have achieved notable success in domains such as the Arcade Learning Environment, the combinatorial optimization community has predominantly favored policy-gradient algorithms, often overlooking the potential of value-based alternatives. From our results, value-based algorithms demonstrated a lower variance and a more stable convergence profile compared to policy-gradient ones. Moreover, they achieved superior cross-size and cross-distribution generalization, that is, effectively solving instances that are substantially larger or structurally distinct from those seen during training. Finally, our analysis also suggests that the relative performance of each category of algorithms may be dependent on structural properties of the problem, such as problem flexibility and instance size. Overall, our findings challenge the prevailing assumption that policy-gradient algorithms are inherently superior for combinatorial optimization. We show instead that value-based algorithms can match or even surpass the performance of policy-gradient algorithms, suggesting that they deserve greater attention from the combinatorial optimization community. Our code is openly available at: https://github.com/AJ-Correa/Unraveling-the-Rainbow

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2505.03323

Country: North America > United States > Texas (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method

Neural Information Processing SystemsOct-10-2024, 20:43:13 GMT

Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions.In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.

combinatorial problem, optimizing solution-sampler, policy-gradient method, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

Comte, Céline, Jonckheere, Matthieu, Sanders, Jaron, Senen-Cerda, Albert

arXiv.org Artificial IntelligenceDec-5-2023

Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.

algorithm, probability, stationary distribution, (16 more...)

arXiv.org Artificial Intelligence

2312.02804

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Learning to Team-Based Navigation: A Review of Deep Reinforcement Learning Techniques for Multi-Agent Pathfinding

Chung, Jaehoon, Fayyad, Jamil, Younes, Younes Al, Najjaran, Homayoun

arXiv.org Artificial IntelligenceAug-10-2023

Multi-agent pathfinding (MAPF) is a critical field in many large-scale robotic applications, often being the fundamental step in multi-agent systems. The increasing complexity of MAPF in complex and crowded environments, however, critically diminishes the effectiveness of existing solutions. In contrast to other studies that have either presented a general overview of the recent advancements in MAPF or extensively reviewed Deep Reinforcement Learning (DRL) within multi-agent system settings independently, our work presented in this review paper focuses on highlighting the integration of DRL-based approaches in MAPF. Moreover, we aim to bridge the current gap in evaluating MAPF solutions by addressing the lack of unified evaluation metrics and providing comprehensive clarification on these metrics. Finally, our paper discusses the potential of model-based DRL as a promising future direction and provides its required foundational understanding to address current challenges in MAPF. Our objective is to assist readers in gaining insight into the current research direction, providing unified metrics for comparing different MAPF algorithms and expanding their knowledge of model-based DRL to address the existing challenges in MAPF.

agent, algorithm, mapf, (15 more...)

arXiv.org Artificial Intelligence

2308.05893

Country:

North America > Canada > British Columbia > Regional District of Central Okanagan > Kelowna (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.67)

Industry:

Transportation (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Add feedback

Policy-Gradient Methods for Planning

Neural Information Processing SystemsApr-6-2023, 15:32:04 GMT

Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning -- in the form of a policy-gradient method -- to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our ap- proach is to construct simple policies, or agents, for each planning task.

policy-gradient method

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.32)

Add feedback

Policy-Gradient Methods for Planning

Aberdeen, Douglas

Neural Information Processing SystemsDec-31-2006

Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning -- in the form of a policy-gradient method -- to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use.

agent, decision tree, probability, (17 more...)

Neural Information Processing Systems

Country: Oceania > Australia > Australian Capital Territory > Canberra (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Policy-Gradient Methods for Planning

Aberdeen, Douglas

Neural Information Processing SystemsDec-31-2006

Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning -- in the form of a policy-gradient method -- to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use.

agent, decision tree, probability, (17 more...)

Neural Information Processing Systems

Country: Oceania > Australia > Australian Capital Territory > Canberra (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Policy-Gradient Methods for Planning

Aberdeen, Douglas

Neural Information Processing SystemsDec-31-2006

Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning -- in the form of a policy-gradient method -- to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach isto construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use.

Add feedback

Filters

Collaborating Authors

policy-gradient method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method

Unraveling the Rainbow: can value-based methods schedule?

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method

Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

Learning to Team-Based Navigation: A Review of Deep Reinforcement Learning Techniques for Multi-Agent Pathfinding

Policy-Gradient Methods for Planning

Policy-Gradient Methods for Planning

Policy-Gradient Methods for Planning

Policy-Gradient Methods for Planning